Large-scale whole exome sequencing studies identify two genes,CTSL and APOE, associated with lung cancer

Common genetic variants associated with lung cancer have been well studied in the past decade. However, only 12.3% heritability has been explained by these variants. In this study, we investigate the contribution of rare variants (RVs) (minor allele frequency <0.01) to lung cancer through two large whole exome sequencing case-control studies. We first performed gene-based association tests using a novel Bayes Factor statistic in the International Lung Cancer Consortium, the discovery study (European, 1042 cases vs. 881 controls). The top genes identified are further assessed in the UK Biobank (European, 630 cases vs. 172 864 controls), the replication study. After controlling for the false discovery rate, we found two genes, CTSL and APOE, significantly associated with lung cancer in both studies. Single variant tests in UK Biobank identified 4 RVs (3 missense variants) in CTSL and 2 RVs (1 missense variant) in APOE stongly associated with lung cancer (OR between 2.0 and 139.0). The role of these genetic variants in the regulation of CTSL or APOE expression remains unclear. If such a role is established, this could have important therapeutic implications for lung cancer patients.


Introduction
Lung cancer (LC) is the most commonly diagnosed cancer in men and the third most commonly occurring cancer in women worldwide as estimated in 2018 [1], with an estimated 2.3 millions new cancers diagnosed annually.It is the leading cause of cancer death worldwide with 1.8 million annual deaths accounting for 18% of all cancer deaths [1].Although reduction of tobacco consumption remains the most appropriate strategy to reduce LC burden, only 10%-15% of all smokers eventually develop LC [2][3][4].In Asian countries, up to 30%-40% of lung cancer cases occur in never smokers [4], which suggests a possible role of genetic factors among others.
Common genetic variants associated with LC have been identified through large, collaborative, genome-wide association studies (GWASs), including susceptibility loci at CHRNA3/5, TERT, HLA, BRCA2, CHEK2 [5,6].Yet, they explained only about 12.3% of LC heritability reported in a recent GWAS [7].It is therefore hypothesized that some of the unexplained variability might be due to rare variants (RVs) [8].A recent study was able to identify 48 germline RVs with deleterious effects on LC in known candidate genes such as BRCA2 in a sample of 260 case patients with the disease and 318 controls [9].More recently, Liu et al. [10] identified 25 deleterious RVs associated with LC susceptibility, including 13 reported in ClinVar.Of the five validated candidates, the authors identified two pathogenic variants in known LC susceptibility loci, ATM p.V2716A (Odds Ratio 19.55, 95%CI [5.04,75.6])and MPZL2 p.I24M frameshift deletion (Odds Ratio 3.88, 95%CI [1.71,8.8]);and three in novel LC susceptibility genes including POMC, STAU2 and MLNR.
To improve the detection of RVs in sequencing studies, we recently proposed a gene-based test for case-control study designs using a Bayes Factors (BF) statistic [11], comparing the total RV counts between cases and controls.Informative priors can be included in this setting, making the BF also sensitive to allelic distribution differences at single variant sites between cases and controls.To elucidate the inherited germline RVs associated with LC, we applied our novel BF approach to whole exome sequencing (WES) data from the International Lung Cancer consortium (ILCCO) [10], with the goal to identify new genes associated with LC specifically focused on RVs as well as potential causal variants within these genes.Independent replication of the most promising genes and RVs was performed in the UK Biobank data [12].

Ethics statement
All participants provided written informed consent, and the study was reviewed and approved by institutional ethic committee of each study site including HSPH-MGH, University Health Network and Mount Sinai Hospital in Toronto (Toronto), University of Liverpool in UK (Liverpool) and IARC.

Study population for gene-based and RV discovery
Case patients with LC and matched healthy individuals were identified from four independent case series that form the ILCCO consortium, including Harvard University School of Public Health/Massachusetts General Hospital (HSPH-MGH), University Health Network and Mount Sinai Hospital in Toronto (Toronto), University of Liverpool in UK (Liverpool) and the International Agency for Research on Cancer (IARC).The original data includes 2047 samples, of which 44 are HapMap controls and 68 were flagged by the Center for Inherited Disease Research (CIDR) as duplicates, related individuals or quality control outliers.Whole exome sequencing was performed for selected LC cases and frequency-matched unaffected controls, to identify novel common and rare genetic variants associated with LC risk.To enrich the relevance of genetics in the cases, LC patients were preferentially selected from those with a family history of LC among first-degree relative or early-onset (<60 years).About the same number of controls were selected, frequency-matched by age and sex with the cases.To adjust for population stratification, principal components (PCs) were derived from the genome-wide data from the ILCCO.The analysis was restricted to those with European ancestry.The representation of the top 3 PCs (S1 Fig) identified one outlier participant with possible non-European ancestry, and was removed from the analysis.We further removed 10 individuals with genotype missing rate >10% and one individual was flagged with very low heterozygosity rate (> 6 standard deviations below the mean heterozygosity).After the filtering steps, a total of 1923 subjects remained in the study and were included in the analyses.

Study population for gene-based and RV replication
We used UK Biobank WES data as the validation set [13,14].Among the total number of 200,643 samples, our analysis includes all LC patients after excluding those diagnosed at most 5 years before any other primary cancers and controls with no cancer diagnosis history.We also removed at random one individual from each pair of individuals closer than 3 rd degree relatives (kinship coefficient > 0.0884), and subjects who self-reported a non-white ethnic background.After the filtering, 173,494 individuals remained in the study.

Germline Sequencing/QC
ILCCO.The sequencing of whole exomes and additional targeted regions of DNA samples from all 4 different sites was performed at the CIDR.Targeted regions were selected based on previous associations with LC or with histological LC subtypes from GWASs on common variants [5,6].After initial quality control (QC) analysis by CIDR [10], the mean on-target coverage was 52X and more than 97% of targeted bases had a depth greater than 10X.Further QC analysis was performed including the following steps: i) Exclusion of variants with QUAL<100 indicating a low probability that there is a variant at a site or mean GQ<50 indicating low probabilities that genotype calls were correct across individuals at a site so that Ts/ Tv ratio is greater than 2 (S2 Fig) ; ii) Exclusion of singleton variants (variant with occurrence of only 1 minor allele) when minor allele has GQ<50 or depth <20; iii) Exclusion of non-biallelic variants and variants on the sex chromosome; iv) Exclusion of variants with p-value of Hardy-Weinberg equilibrium test <1e-7 in the control samples; v) Set individual genotype as missing if GQ<30 or depth<10; vi) Exclusion of variants with minor allele frequency (MAF)> 1% (MAF was estimated using study population).The MAF distribution of the remaining RVs is given in Table 1.UK Biobank.We performed the following QC steps for all genes selected in the discovery set: i) exclude variants that are not bi-allelic and those with QUAL<10; ii) filter out variants with mean GQ<30 as well as singleton variants with depth <20 or GQ<40; iii) set genotype missing if depth<10 or GQ<20, and exclude variants with missing genotype rate >10%; iv) exclude variants with MAF>1% (MAF estimated using study population).
In both the discovery and replication studies, for our gene-based analyses, we considered -/+ 1k bp up-and down-stream sites of each gene (including non-exonic RVs) for the analysis.
Gene-based analysis.To increase the power of discovering genes associated with LC, we applied a gene-based approach based on a Bayes Factor (BF) statistic that we recently developed, to both the discovery and replication studies [11].It was designed specifically to test the association between a set of RVs located in the same region or in a gene and a disease outcome in the context of case-control designs.An advantage of our BF approach over existing methods is the possibility to introduce an "informative" prior to gain power to detect gene-based associations, where this prior is sensitive to allelic differences between cases and controls for a particular gene (S1 Text).Compared to the commonly-used SKAT gene-based test [15], our BF approach is more sensitive to an excess of small p-values from single RV tests within each gene while SKAT has better power to detect genes exhibiting systematic allelic differences between cases and controls across all RVs.This difference was discussed in details in [11] and illustrated on two genes that showed large discrepancy in overall ranking when applying these two approaches [11].In this study, we applied two versions of the BF test statistic, BF KS and BF SKAT , where either a Kolmogorov-Smirnov (KS) or SKAT p-value is used as informative prior.This gave us higher chance to detect genes that may have different underlying RV allelic distribution differences between cases and controls.The respective advantage of each approach is described in details in the S1 Text.In this paper, we mainly focused on BF KS and used BF SKAT as a secondary analysis.
To assess the sensitivity of the association tests on confounding variables, we conducted sensitivity analyses on the genome-wide significant genes and adjusted our analyses for age, sex, smoking and the top 5 PCs used to control for population stratification.Both the BF and the prior components (KS or SKAT p-value) were adjusted.The extention of BF KS and BF SKAT incorporating covariates is described in S1 Text.
Single RV-based analysis.For the two genes that passed a gene-based replication genome-wide significance level (see below), i.e., APOE and CTSL, we performed single RV tests only with UK Biobank since this study has larger coverage of RVs.We used the Firth's bias-reduced logistic regression to deal with sparse allelic counts [16].Analyses were adjusted for age, sex, smoking status (ever vs. never smoking) and the top five PCs.RVs that pass a FDR adjusted q value [17] of 0.01 were selected.
Significance threshold for gene-based replication analysis.We denote P d the P value for selecting genes in the discovery cohort (ILCCO) and P r the P value for selecting a gene in the replication cohort (UK Biobank).We set γ as the significance threshold for selecting genes in the discovery cohort and which will be followed-up for replication in UK biobank and λ the significance level in the replication cohort.To control the gene-based family-wise error rate where V is number of genes declared achieved signficiance levels in both discovery and validation studies, P d �γ and P r �λ, where γ and λ were determined through permutation analysis, as follows.First, we repeated analyses of ILCCO (discovery set) and UK Biobank (validation set) studies 100 times, where each time the phenotype of individuals was permuted.Second, we determined the two thresholds such that among 100 replicates, the number of identified significant genes is less or equal to 100×α = 100×0.05= 5, for a genome-wide control of FWER�5%.We found the following thresholds, γ = 5×10 −4 and λ = 0.05 in the discovery and validation study, respectively, when using BF KS as the test statistic (i.e., our main statistic).Therefore, in our application analysis, the set of genes that passed a significance threshold of γ = 5×10 −4 in the discovery (ILCCO) cohort and λ = 0.05 in the replication (UK Biobank) cohort were declared associated with the disease and replicated.

Gene-Based analysis
In the discovery study, a total of 13,872 genes with at least 20 bi-allelic RVs were analyzed based on the QC pipeline described.The QQ plots corresponding to 2log(BF KS ) and 2log  (3).Using a significance level of γ = 5×10 −4 in the discovery cohort (see Methods section), a total of 17 genes based on BF KS and 14 genes using BF SKAT (Tables 3 and 4) were selected for replication.The 2 top genes are CTSL (P = 4.9×10 −5 ) and TBX4 (P = 6.5×10 −5 ) with BF KS , VAV2 (P = 1.9×10 −5 ) and DENND4B (P = 4.3×10 −5 ) with BF SKAT .Four genes are found by both test statistics including CTSL, TBX4, C8orf44, and DGKB.Using a significance level of λ = 0.05 (see Methods section) in the replication study, we were able to replicate only one gene, CTSL (P = 2.7×10 −3 ), when using the BF KS test and the two genes APOE (P = 1.9×10 −3 ) and CTSL (P = 6.9×10 −6 ) based on the BF SKAT test (Tables 3 and 4).For each gene identified in the discovery set, we calculated an overall p-value in Tables 3 and 4 by combining p-values from the discovery and validation sets using Fisher's method [18].

Sensitivity analysis
We found that the association signal for CTSL did not change much after adjustment for confounders using BF KS

Single RV-based analysis
In UK Biobank, a total of 155 bi-allelic RVs for CTSL and 174 for APOE were included in the analysis.In CTSL, 4 RVs were found associated with LC at an FDR q-value of 0.01, including  variant at positions 87728433 (rs771328780), 87729621 (rs778002071), 87730426 (rs777251059) and 87727608 (rs112682750) on chromosome 9 (Table 5), where the last 3 were missense variants.In APOE, 2 RVs passed this significance level, including variant at position 44907893 (rs number not available) and 44906640 (rs1568615382) on chromosome 19.Most of the variants found to be associated with LC risk are very rare (MAF<10 −4 in controls), except one missense variant in CTSL, rs112682750, has a MAF of 7.7×10 −3 .All the 6 RVs are associated with increased LC risk as indicated by an odds-ratio>1 in UK Biobank.One of the 6 RVs was present in ILCCO, rs112682750 in CTSL, but it did not show association with LC after adjustment for age, sex, smoking and PCs (P = 0.19).

Genomic region analysis of rs112682750 in CTSL
Using cancer cell lines from the USCS genome browser, a genomic analysis of the region around rs112682750 indicates that this variant is located within a promoter/enhancer region of CTSL in lung related cells (S3 Fig).This suggests that rs112682750 might affect the transcription of CTSL.

Annotation of Single RVs in CTSL and APOE
We searched functional annotation for the 6 associated RVs identified from CTSL and APOE using Ensembl Variant Effect Predictor (VEP) [20], Combined Annotation Dependent Depletion (CADD) [21,22] and Functional Annotation of Variants-Online Resource (FAVOR) [23].The search results indicated that rs778002071 (CTSL) was categorized as deleterious nonsynonymous variant, according to all three annotation resources, and the rest 5 RVs were predicted to be tolerated (benign) by at least one resource (Table 6).

Discussion
By focusing on rare variants using whole exome sequencing data, we identified two new genes, CTSL and APOE, associated with LC in the ILCCO study, that were replicated in the UK Based on the Firth biased-corrected logistic regression [16] b Only RVs with a q-value � 0.01 were selected.https://doi.org/10.1371/journal.pgen.1010902.t005 Biobank study.In CTSL, 3 missense RVs and 1 RV with unknown significance were discovered as associated with LC in the UK Biobank study.In APOE, 1 missense variant and 1 with unknown significance were discovered.
The Cathepsin L gene (CTSL), is a ubiquitously expressed lysosomal endopeptidase that is primarily involved in terminal degradation of intracellular and endocytosed proteins [25].CTSL has recently gained attentions for its roles in SARS-CoV2 entry to host cell by cleaving receptor-bound viral spike protein, which results in further activation and infection [26,27].While potential functional connection between viral infection and lung cancer susceptibility remains to be established, CTSL also has roles relevant in tumorigenesis and progression.CTSL upregulation has been reported in a wide range of human malignancies including ovarian, breast, prostate, lung, gastric, pancreatic and colon cancers [28].Importantly, evidence indicates that CTSL expression may be linked to cancer grade and stage.In LC patients, higher CTSL activity has been reported compared to non-malignant tissue as well as association between tumor grade and upregulated serum levels [29].The role of CTSL in promoting tumor progression and metastatic aggressiveness has also been suggested [30].Significant interest in the development of CTSL intervention strategies has also emerged.For example, CTSL downregulation through RNA interference in different tumor models (including glioma, osteosarcoma, myeloma and melanoma) resulted in consistent inhibition of tumorigenicity and invasiveness of neoplastic cells [31][32][33][34].The identification of patients who might benefit from anti-CTSL therapy remains an important clinical question.The identification of new RVs that correlate with LC risk in our study could therefore help identify these patients.Although the impacts of these variants to CTSL levels or activity in early vs. late stages of lung tumorigenesis need to be established, potential regulatory function of the most common variant we identified in CTSL, rs112682750, for instance, could be hypothesized.The apolipoprotein E gene (APOE) codes for a protein associated with lipid particles, that mainly functions in lipoprotein-mediated lipid transport between organs via the plasma and interstitial fluids.APOE is also associated with atherosclerogenesis, which itself has been involved in tumor development.APOE has been shown to act as a growth factor that can influence carcinogenesis [35].In patients with LC, the levels of APOE gene expression were significantly higher in cancer tissue than in adjacent non-cancer tissue [36].Serum APOE has also been associated with lymph node metastasis in lung adenocarcinoma patients [37].It was also reported that high expression of APOE promotes cancer cell proliferation and migration and contributes to an aggressive clinical course in patients with lung adenocarcinoma [38].APOE has also raised interest for therapeutic interventions.For instance, APOE was involved in the inhibition of melanoma metastasis and angiogenesis by stimulating the immune response to tumor cells [39].Identification of genetic variants that could regulate APOE expression could therefore have important therapeutic implications.Of note, APOE was only detected with one version of our BF approach (i.e., BF SKAT ) and further validation of this gene is warranted.
The strengths of our study include the large sample sizes available for discovery and replication of the gene-based analyses and the use of UK Biobank data for RV discoveries.Our statistical approach for gene discovery, the Bayes Factor statistic, has also been shown to have increased power compared to competing approaches such as SKAT and the Burden test [11].Another significant advantage is its sensitivity to detect single RV associations through the definition of informative priors.Under our statistical framework, the discovery of RVs can therefore be thought as a two-step approach where the first step is a gene-based analysis and the second step, an RV association test within the set of significantly associated genes.
Our study contrasts with Liu et al.'s analysis of the ILLCO data [10] in several aspects.They performed single RV analyses focusing only on suspected deleterious variants.In a second step, they performed gene-based tests using only genes that included RVs that were significantly associated with LC after controlling for multiple comparisons from a Burden test.In comparison, we tested all the genes in the discovery cohort and did not make any assumption regarding the possible functional effect of the RVs.
The discovery of RVs in the context of sequencing studies remains a field of intensive research.
The limitations of this study include the need for further validation and characterization of the two genes and RVs identified, in particular to correlate them with disease progression outcomes and LC subtypes.Also, the benefit for therapeutic interventions may be considered as it could lead to a more personalized treatment of LC patients targeting specific gene/pathway mechanisms such as the immune response system.

Fig 1 .
Fig 1. QQ plot of ILCCO WES study.The departure of the right tail from the 45 degree line represents the association signals from the study.(A) illustrates results using BF with KS prior.Under the null hypothesis (no association between genes and phenotype), 2logBF ks ~χ2 (3).(B) shows results using BF with SKAT prior.Similarly, 2logBF SKAT ~χ2 (3) under the null hypothesis.https://doi.org/10.1371/journal.pgen.1010902.g001

Table 2 . Basic demographic characteristics in the discovery and validation studies. Discovery (ILCCO) Replication (UK Biobank)
SKAT ) statistics are presented in Fig 1 and confirm that they are both asymptotically distributed as χ 2 NS: not significant https://doi.org/10.1371/journal.pgen.1010902.t002(BF

Table 3 . Results of gene-based analyses using BF KS test a in the discovery and replication studies.
a. Bayes factor (BF) approach using Kolmogorov-Smirnov (KS) test as prior b.P value of KS test c.P value of BF with KS prior d.Genes with #(sites)<20 were excluded from BF test https://doi.org/10.1371/journal.pgen.1010902.t003

Table 4 . Results of gene-based analyses using BF SKAT a in the discovery and replication studies.
c. P value of BF with SKAT prior https://doi.org/10.1371/journal.pgen.1010902.t004

Table 5 . Results of single RV-based association analysis in the genes CTSL and APOE using UK Biobank data.
a